Native Language Identification Using a Mixture of Character and Word N-grams
نویسندگان
چکیده
Native language identification (NLI) is the task of determining an author’s native language, based on a piece of his/her writing in a second language. In recent years, NLI has received much attention due to its challenging nature and its applications in language pedagogy and forensic linguistics. We participated in the NLI Shared Task 2017 under the name UT-DSP. In our effort to implement a method for native language identification, we made use of a mixture of character and word Ngrams, and achieved an optimal F1-score of 0.7748, using both essay and speech transcription datasets.
منابع مشابه
CIC-FBK Approach to Native Language Identification
We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are novel in this task, such as typed character n-gram...
متن کاملNative Language Identification using large scale lexical features
This paper describes an effort to perform Native Language Identification (NLI) using machine learning on a large amount of lexical features. The features were collected from sequences and collocations of bare word forms, suffixes and character n-grams amounting to a feature set of several hundred thousand features. These features were used to train a linear Support Vector Machine (SVM) classifi...
متن کاملVTEX System Description for the NLI 2013 Shared Task
This paper describes the system developed for the NLI 2013 Shared Task, requiring to identify a writer’s native language by some text written in English. I explore the given manually annotated data using word features such as the length, endings and character trigrams. Furthermore, I employ k-NN classification. Modified TFIDF is used to generate a stop-word list automatically. The distance betw...
متن کاملA study of N-gram and Embedding Representations for Native Language Identification
We report on our experiments with Ngram and embedding based feature representations for Native Language Identification (NLI) as a part of the NLI Shared Task 2017 (team name: NLI-ISU). Our best performing system on the test set for written essays had a macro F1 of 0.8264 and was based on word uni, bi and trigram features. We explored n-grams covering word, character, POS and word-POS mixed repr...
متن کاملNative Language Identification using Phonetic Algorithms
In this paper, we discuss the results of the IUCL system in the NLI Shared Task 2017. For our system, we explore a variety of phonetic algorithms to generate features for Native Language Identification. These features are contrasted with one of the most successful type of features in NLI, character n-grams. We find that although phonetic features do not perform as well as character n-grams alon...
متن کامل